Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks

نویسندگان

  • Barry Haddow
  • Beatrice Alex
چکیده

This paper discusses the problem of utilising multiply annotated data in training biomedical information extraction systems. Two corpora, annotated with entities and relations, and containing a number of multiply annotated documents, are used to train named entity recognition and relation extraction systems. Several methods of automatically combining the multiple annotations to produce a single annotation are compared, but none produces better results than simply picking one of the annotated versions at random. It is also shown that adding extra singly annotated documents produces faster performance gains than adding extra multiply annotated documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-corpus Training with Treelstm for the Extraction of Biomedical Relationships from Text

A bottleneck problem in machine learning-based relationship extraction (RE) algorithms, and particularly of deep learning-based ones, is the availability of training data in the form of annotated corpora. For specific domains, such as biomedicine, the long time and high expertise required for the development of manually annotated corpora explain that most of the existing one are relatively smal...

متن کامل

Cross-corpus Training with Treelstm for the Extraction of Biomedical Relationships from Text

A bottleneck problem in machine learning-based relationship extraction (RE) algorithms, and particularly of deep learning-based ones, is the availability of training data in the form of annotated corpora. For specific domains, such as biomedicine, the long time and high expertise required for the development of manually annotated corpora explain that most of the existing one are relatively smal...

متن کامل

Exploiting large corpora: A circular process of partial syntactic analysis, corpus query and extraction of lexicographic information

Our approach follows the work of Eckle-Kohler (1999) who used a regular grammar to extract lexicographic information from text corpora. We employ a system that allows to improve her querybased grammar especially with respect to recall and speed without reducing accuracy. In contrast to Eckle-Kohler (1999), we do not attempt to parse a whole sentence or phrase at once during the extraction proce...

متن کامل

Annotating Web pages for the needs of Web Information Extraction Applications

This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the annotation of Web pages from different domains and for different information extraction tasks providing a user-friendly interface to human annotators. Annotated information is stored in a representation format that can ...

متن کامل

The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions

We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, DevelopmentalStage, Disease, DrugCompound, ExperimentalMethod, Fragment, Fusion, GOMOP, Gene, Modification, mRNAcDNA, Mutant, Protein, Tissue)...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008